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HIGH-DIMENSIONAL VARIABLE SELECTION 

By Larry Wasserman and Kathryn Roeder 1 
Carnegie Mellon University 

This paper explores the following question: what kind of statis- 
tical guarantees can be given when doing variable selection in high- 
dimensional models? In particular, we look at the error rates and 
power of some multi-stage regression methods. In the first stage we 
fit a set of candidate models. In the second stage we select one model 
by cross-validation. In the third stage we use hypothesis testing to 
eliminate some variables. We refer to the first two stages as "screen- 
ing" and the last stage as "cleaning." We consider three screening 
methods: the lasso, marginal regression, and forward stepwise regres- 
sion. Our method gives consistent variable selection under certain 
conditions. 

1. Introduction. Several methods have been developed lately for high- 
dimensional linear regression such as the lasso [Tibshirani (1996)], Lars 
[Efron et al. (2004)] and boosting [Biihlmann (2006)]. There are at least 
two different goals when using these methods. The first is to find models 
with good prediction error. The second is to estimate the true "sparsity 
pattern," that is, the set of covariates with nonzero regression coefficients. 
These goals are quite different and this paper will deal with the second goal. 
(Some discussion of prediction is in the Appendix.) Other papers on this 
topic include Meinshausen and Biihlmann (2006), Candes and Tao (2007), 
Wainwright (2006), Zhao and Yu (2006), Zou (2006), Fan and Lv (2008), 
Meinshausen and Yu (2008), Tropp (2004, 2006), Donoho (2006) and Zhang 
and Huang (2006). In particular, the current paper builds on ideas in Mein- 
shausen and Yu (2008) and Meinshausen (2007). 

Let (Xi,Y\), . . . , (X n ,Y n ) be i.i.d. observations from the regression model 

(1) Yi = Xfp + £i , 
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where e ~ N(0, a 2 ), X { = (Xn , . . . , X ip ) T G W and p = p n > n. Let X be 
the n x p design matrix with jth column X,j = (Xij, . . . ,X n j) T and let 
Y = (Y u ...,Y n ) T . Let 

D = {j:f3 j ^0} 

be the set of covariates with nonzero regression coefficients. Without loss 
of generality, assume that D = {l,...,s} for some s. A variable selection 
procedure D n maps the data into subsets of 5 = {1, . . . ,p}. 

The main goal of this paper is to derive a procedure D n such that 

(2) limsupP(l) n C-D)>l-a, 

n— *oo 

that is, the asymptotic type I error is no more than a. Note that throughout 
the paper we use C to denote nonstrict set-inclusion. Moreover, we want 
D n to have nontrivial power. Meinshausen and Biihlmann (2006) control a 
different error measure. Their method guarantees limsup n ^ 00 P(I) ri n V ^ 
0) < q where V is the set of variables not connected to Y by any path in 
an undirected graph. 

Our procedure involves three stages. In stage I we fit a suite of candidate 
models, each model depending on a tuning parameter A, 

S = {S n (\):\£A}. 

In stage II we select one of those models S n using cross-validation to select A. 
In stage III we eliminate some variables by hypothesis testing. Schematically, 

. stage I .-, stage II ^ stage III 

data — > 5 — > y b n — > D n . 

screen ~~ de^r - 

Genetic epidemiology provides a natural setting for applying screen and 
clean. Typically, the number of subjects, n, is in the thousands, while p 
ranges from tens of thousands to hundereds of thousands of genetic fea- 
tures. The number of genes exhibiting a detectable association with a trait 
is extremely small. Indeed, for type I diabetes only ten genes have exhibited 
a reproducible signal [Wellcome Trust (2007)]. Hence, it is natural to assume 
that the true model is sparse. A common experimental design involves a 2- 
stage sampling of data, with stages 1 and 2 corresponding to the screening 
and cleaning processes, respectively. 

In stage 1 of a genetic association study, n\ subjects are sampled and one 
or more traits such as bone mineral density are recorded. Each subject is 
also measured at p locations on the chromosomes. These genetic covariates 
usually have two forms in the population due to variability at a single nu- 
cleotide and hence are called single nucleotide polymorphisms (SNPs). The 
distinct forms are called alleles. Each covariate takes on a value (0, 1 or 2) 
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indicating the number of copies of the less common allele observed. For a 
well-designed genetic study, individual SNPs are nearly uncorrelated unless 
they are physically located in very close proximity. This feature makes it 
much easier to draw causal inferences about the relationship between SNPs 
and quantitative traits. It is standard in the field to infer that an association 
discovered between a SNP and a quantitative trait implies a causal genetic 
variant is physically located near the one exhibiting association. In stage 2, 
ri2 subjects are sampled at a subset of the SNPs assessed in stage 1. SNPs 
measured in stage 2 are often those that achieved a test statistic that ex- 
ceeded a predetermined threshold of significance in stage 1. In essence, the 
two stage design pairs naturally with a screen and clean procedure. 

For the screen and clean procedure, it is essential that S n has two prop- 
erties as n — > oo as follows: 

(3) P(D cS n )^l 
and 

(4) \S n \=o P (n), 

where \M\ denotes the number of elements in a set M. Condition (3) ensures 
the validity of the test in stage III while condition (4) ensures that the power 
of the test is not too small. Without condition (3), the hypothesis test in 
stage III would be biased. We will see that the power goes to 1, so taking 
ck = cn n — ► implies consistency: P(-D n = D) — ► 1. For fixed a, the method 
also produces a confidence sandwich for D, namely, 

liminf F(D n C D C S n ) > 1 - a. 

n— >oo 

To fit the suite of candidate models, we consider three methods. In method 

1, 

S n (\) = {j:P j (\)^0}, 
where (3j{\) is the lasso estimator, the value of (3 that minimizes 

n p 

i=i j=i 

In method 2, take S n (X) to be the set of variables chosen by forward stepwise 
regression after A steps. In method 3, marginal regression, we take 

Sn = {j-\fij\ > A}, 

where /L is the marginal regression coefficient from regressing Y on Xj . (This 
is equivalent to ordering by the absolute t-statistics since we will assume that 
the covariates are standardized.) These three methods are very similar to 
basis pursuit, orthogonal matching pursuit and thresholding [see, e.g., Tropp 
(2004, 2006) and Donoho (2006)]. 
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Notation. Let ip = min, g £) \j3j\. Define the loss of any estimator (3 by 

(5) L0) = -0- (3) T X T X0 -P) = 0- P) T %0 - /?), 

n 

where S n = n~ 1 X T X. For convenience, when (5 = /3(A) depends on A we 
write L(X) instead of L(/3(A)). If McS, let Xm be the design matrix 
with columns (X,j :j 6 M) and let /?m = (X m Xm)~ x X M Y denote the least- 
squares estimator, assuming it is well defined. Note that our use of X,a differs 

from standard AN OVA notation. Write X\ instead of Xm when M = S n (X). 
When convenient, we extend (3 m to length p by setting 0m (j) = 0, for j ^ M. 
We use the norms 

\\ v \\ = \'^^ v 'j> IMIl = \ v j\ an( ^ ll w lloo =max|«j-|. 
V 3 3 j 

If C is any square matrix, let <j)(C) and 3>(C) denote the smallest and 
largest eigenvalues of C. Also, if k is an integer, define 

4> n {k)= min <f>( -X m Xm] and $ n (k) = max $(-X m Xm\ 

M:\M\=k \n J M:\M\=k \n J 

We will write z u for the upper quantile of a standard normal, so that P(Z > 
z u ) = u where Z ~ N(0, 1). 

Our method will involve splitting the data randomly into three groups 
T>i, T>2 and T>^. For ease of notation, assume the total sample size is 3n and 
that the sample size of each group is n. 

Summary of assumptions. We will use the following assumptions through- 
out except in Section 8: 

(Al) Yi = Xf(3 + Ei where e 4 ~ N(0, a 2 ), for i = 1, . . . , n. 

(A2) The dimension p n of X satisfies p n — > oo and p n < c\e n 2 , for some 
ci > and < c 2 < 1. 

(A3) s = \{j : (3j / 0}| = O(l) and V = minjl/3,1 : / 0} > 0. 

(A4) There exist positive constants Cq,Ci and k such that 
P(limsup n _ >00 3> n (n) < Co) = 1 and P(liminf n ^ 00 4> n (Ci logn) > k) = 1. Also, 
P(</> n (n) > 0) = 1, for all n. 

(A5) The covariates are standardized: E(Xy) = and E(X£) = 1. Also, 
there exists < B < oo such that P(|X jfc | < B) = 1. 

For simplicity, we include no intercepts in the regressions. The assump- 
tions can be weakened at the expense of more complicated proofs. In par- 
ticular, we can let s increase with n and tp decrease with n. Similarly, the 
normality and constant variance assumptions can be relaxed. 



VARIABLE SELECTION 



5 



2. Error control. Define the type I error rate q{D n ) = ¥(D n n D c 7^ 
0) and the asymptotic error rate limsup n ^ 00 q(D n ). We define the power 
7r(D n ) = F(D C D n ) and the average power 

vr av = - £ H3 G AO- 

It is well known that controlling the error rate is difficult for at least 
three reasons: correlation of covariates, high-dimensionality of the covariate 
and unfaithfulness (cancellations of correlations due to confounding) . Let us 
briefly review these issues. 

It is easy to construct examples where q(D n ) < a implies that 7r(D n ) ~ a. 
Consider the following two models for random variables Z = (Y,Xi,X 2 )' 

Model 1 Model 2 

Xi~iV(0,l), X 2 ~iV(0,l), 

Y = ipX 1 + N (0,1), Y = 4>X 2 + N(0,1), 

X 2 = P X 1 + N(0,t 2 ). X 1 = P X 2 + N(0,t 2 ). 

Under models 1 and 2, the marginal distribution of Z is Pi = N(0, £1) 
and P 2 = N(0, £2), where 

/ ip 2 + 1 ip pip \ ( ip 2 + 1 pip ip\ 

£1= ip 1 P , £2= [pip P 2 + r 2 p . 

V ptp p p 2 + t 2 ) v v p 1/ 

Given any e > 0, we can choose p sufficiently close to 1 and r sufficiently close 
to such that £1 and £2 are as close as we like, and hence, d(Pi,P 2 ) < e 
where d is total variation distance. It follows that 

P 2 (2 i D) > Pi (2 i D) - e > 1 - a - e. 

Thus, if q < a, then the power is less than a + e. 

Dimensionality is less of an issue thanks to recent methods. Most methods, 
including those in this paper, allow p n to grow exponentially. But all the 
methods require some restrictions on the number s of nonzero /3/s. In other 
words, some sparsity assumption is required. In this paper we take s fixed 
and allow p n to grow. 

False negatives can occur during screening due to cancellations of cor- 
relations. For example, the correlation between Y and X\ can be even 
when j3\ is huge. This problem is called unfaithfulness in the causality liter- 
ature [see Spirtes, Glymour and Schemes (2001) and Robins et al. (2003)]. 
False negatives during screening can lead to false positives during the second 
stage. 
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Let jUj denote the regression coefficient from regressing Y on Xj . Fix j < s 
and note that 

fj,j = E(j2j) =Pj+ PkPkj, 

l<k<s 

where pkj = coiv(Xk,Xj). If 

J2 Pkpkj~-f3j, 

l<k<s 

then Hj « no matter how large (3j is. This problem can occur even when 
n is large and p is small. 

For example, suppose that (3 = (10, —10,0,0) and that p(Xi,Xj) = ex- 
cept that p(X\,X2) = p(Xi,Xs) = p{X2, X4) = 1 — e, where e > is small. 
Then, 

= (10,-10,0,0), but jtra (0,0,10,-10). 

Marginal regression is extremely susceptible to unfaithfulness. The lasso 
and forward stepwise, less so. However, unobserved covariates can induce 
unfaithfulness in all the methods. 



3. Loss and cross-validation. Let X\ = (X.j :j e S n (\)) denote the de- 
sign matrix corresponding to the covariates in S n (\) and let (3(\) be the 
least-squares estimator for the regression restricted to S n (X), assuming the 
estimator is well defined. Hence, /3(A) = (XjX A )" 1 Xjy. More generally, 
/3m is the least-squares estimator for any subset of variables M. When con- 
venient, we extend /3(A) to length p by setting (3j(\) = 0, for j ^ S n (X). 

3.1. Loss. Now we record some properties of the loss function. The first 
part of the following lemma is essentially Lemma 3 of Meinshausen and Yu 
(2008). 

Lemma 3.1. Let M+= {M C S :\M\ <m,D C M}. Then, 

(6) P( sup L0 M )<*£™)^1. 

Let M~ = {M C S : \M\ < m, D<£M}. Then, 

(7) pf inf L0 M )>^ 2 Mm + s))^l. 

\MeM~ / 
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3.2. Cross-validation. Recall that the data have been split into groups 
T>i, T>2 and D3 each of size n. Construct /3(A) from T>\ and let 

(8) L(A) = - E fr-Xlmf- 

We would like L(X) to order the models the same way as the true loss 
L(A) [defined after (5)]. This requires that, asymptotically, L(X) — L(X) ps 6 n , 
where 5 n does not involve A. The following bounds will be useful. Note that 
L(A) and L(X) are both step functions that only change value when a variable 
enters or leaves the model. 

Theorem 3.2. Suppose that max^gAn |Sn(A)| < k n . Then, there exists 
a sequence of random variables 5 n = Op(l) that do not depend on X or X, 
such that, with probability tending to 1, 

(9) sup |L(A) - L(X) -S n \ = P (-!£-) + Op . 

AeA n \ n / \v n / 

4. Multi-stage methods. The multi-stage methods use the following steps. 
As mentioned earlier, we randomly split the data into three parts, T>\, T>2 
and P3, which we take to be of equal size: 

1. Stage I. Use T>\ to find S n (X), for each A. 

2. Stage II. Use T>2 to find A by cross-validation, and let S n = S n (X). 

3. Stage III. Use P3 to find the least-squares estimate ft for the model S n . 
Let 

D n = {j G S n : \Tj \ > c n }, 
where Tj is the usual t-statistic, c n = z a /2 m and m= \S n \. 

4.1. The lasso. The lasso estimator [Tibshirani (1996)] /3(A) minimizes 

n p 

M x (X) = Y,(Y i -X?ft) 2 + X'£\ft j \ 
i=\ j=i 

and let S n (X) = {j : ftj(X) ^ 0}. Recall that /3(A) is the least-squares estima- 
tor using the covariates in S n (X). 

Let k n = Alogn where A > is a positive constant. 

Theorem 4.1. Assume that (A1)-(A5) hold. Let A n = {X : \S n (X)\ < 
kn}. Then: 

1. The true loss overfits: F(D C S n (X*)) — ► 1 where A* = argmin^gA^ ^(A). 



8 



L. WASSERMAN AND K. ROEDER 



2. Cross-validation also overfits: ¥(D C S n (X)) — > 1 where X = argmin\ e A„ L(X). 

3. Type I error is controlled: limsup n ^ QO F(D c nfl n /0) < a. 

If we let a = a n — > 0, then D n is consistent for variable selection. 

Theorem 4.2. Assume that (Al)-(A5) hold. Let a n ^0 and \pna n — ► 

00. Then, the multi-stage lasso is consistent, 

(10) P(A, = .D)-1. 

The next result follows directly. The proof is thus omitted. 

Theorem 4.3. Assume that (A1)-(A5) hold. Let a be fixed. Then, 
(D n ,S n ) forms a confidence sandwich 

(11) liminf P(5 n C D C S n ) > 1 - a. 

n— >oo 

Remark 4.4. This confidence sandwich is expected to be conservative 
in the sense that the coverage can be much larger than 1 — a. 

4.2. Stepwise regression. Let k n = Alogra for some A > 0. The version 
of stepwise regression we consider is as follows: 

1. Initialize: Res = Y, A = 0, Y = and S n (\) = 0. 

2. Let A <— A + 1. Compute J2j = n~ 1 (Xj, Res) for j = 1, . . . ,p. 

3. Let J = argmaxj \p,j\. Set S n (X) = {S n (X — 1), J}. Set Y = X\(3(X) where 
fix = {,XlXxY 1 X x v Y, and let Res = Y — Y. 

4. If A = k n , stop. Otherwise, go to step 2. 

For technical reasons, we assume that the final estimator x T (5 is truncated 
to be no larger than B. Note that A is discrete and A n = {0, 1, . . . , k n }. 

Theorem 4.5. With S n (X) defined as above, the statements of Theorems 
4.1, 4.2 and 4.3 hold. 

4.3. Marginal regression. This is probably the oldest, simplest and most 
common method. It is quite popular in gene expression analysis. It used to 
be regarded with some derision but has enjoyed a revival. A version appears 
in a recent paper by Fan and Lv (2008). Let S n (X) = {j : \Jlj\ > X} where 
$ j = n- 1 (Y,X.j). 

Let Hj = K(fij), and let fjL(j\ denote the value of fi ordered by their absolute 
values, 

|/i(l)| > 1^(2)1 > 
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Theorem 4.6. Let k n — > oo with k n = o(^/n). Let A n = {A:|S n (A)| < 
k n }. Assume that 

(12) ™gM > lM(fc„)l- 

Then, the statements of Theorems 4-1, 4-% an d 4-3 hold. 



Assumption (12) limits the degree of unfaithfulness (small partial corre- 
lations induced by cancellation of parameters). Large values of k n weaken 
assumption (12), thus making the method more robust to unfaithfulness, but 
at the expense of lower power. Fan and Lv (2008) make similar assumptions. 
They assume that there is a C > such that \fj,j\ > C\8A for all j, which 
rules out unfaithfulness. However, they do not explicitly relate the values of 
Hj for j G D to the values outside D as we have done. On the other hand, 
they assume that Z = S _1 / 2 X has a spherically symmetric distribution. Un- 
der this assumption and their faithfulness assumption, they deduce that the 
Hj's outside D cannot strongly dominate the /ij's within D. We prefer to 
simply make this an explicit assumption without placing distributional as- 
sumptions on X. At any rate, any method that uses marginal regressions as 
a starting point must make some sort of faithfulness assumptions to succeed. 

4.4. Modifications. Let us now discuss a few modifications of the basic 
method. First, consider splitting the data only into two groups, T>\ and T>2- 
Then do the following steps: 

1. Stage I. Find S n (\) for A € A n , where |/S n (A)| < k n for each A G A ra using 

2. Stage II. Find A by cross-validation, and let S n = S n (X) using XV 

3. Stage III. Find the least-squares estimate 8-$ using T>2- Let D n = \j G 
S n :\Tj\ > c n }, where Tj is the usual t-statistic. 

Theorem 4.7. Choosing 

n o\ _ loglogn v /2fc n log(2p TO ) 

a 

controls asymptotic type I error. 



The critical value in (13) is hopelessly large and it does not appear it can 
be substantially reduced. We present this mainly to show the value of the 
extra data-splitting step. It is tempting to use the same critical value as in 
the tri-split case, namely c n = z a / 2m where m = \S n \, but we suspect this 
will not work in general. However, it may work under extra conditions. 
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5. Application. As an example, we illustrate an analysis based on part 
of the osteoporotic fractures in men study [MrOS, Orwoll et al. (2005)]. A 
sample of 860 men were measured at a large number of genes and outcome 
measures. We consider only 296 SNPs which span 30 candidate genes for 
bone mineral density. An aim of the study was to identify genes associated 
with bone mineral density that could help in understanding the genetic basis 
of osteoporosis in men. Initial analyses of this subset of the data revealed 
no SNPs with a clear pattern of association with the phenotype; however, 
three SNPs, numbered (67), (277) and (289), exhibited some association in 
the screening of the data. To further explore the effacacy of the lasso screen 
and clean procedure, we modified the phenotype to enhance this weak signal 
and then reanalyzed the data to see if we could detect this planted signal. 

We were interested in testing for main effects and pairwise interactions 
in these data; however, including all interactions results in a model with 
43,660 additional terms, which is not practical for this sample size. As a 
compromise, we selected 2 SNPs per gene to model potential interaction 
effects. This resulted in a model with a total of 2066 potential coefficients, 
including 296 main effects and 1770 interaction terms. With this model, our 
initial screen detected 10 terms, including the 3 enhanced signals, 2 other 
main effects and 5 interactions. After cleaning, the final model detected the 
3 enhanced signals and no other terms. 

6. Simulations. To further explore the screen and clean procedures, we 
conducted simulation experiments with four models. For each model Yi = 
Xj j3 + £i where the measurement errors, £i and e*j, are i.i.d. normal(0, 1) 
and the covariates X^'s are normal(0, 1) (except for model D). Models differ 
in how Yi is linked to Xj and the dependence structure of the Xj's. Models 
A, B and C explore scenarios with moderate and large p, while model D 
focuses on confounding and unfaithfullness, as follows: 

(A) Null model: (3 = (0, . . . , 0) and the Xy's are i.i.d. 

(B) Triangle model: (3j = 5(10- = 1, . . . , 10, (3j = 0,j> 10 and Xy's are 
i.i.d. 

(C) Correlated Triangle model: as B, but with Xjj( +1 ) = pXy + (1 — 
for j > 1, and p = 0.5. 

(D) Unfaithful model: ^ = faXa + /3 2 X i2 + e u for ft = -(3 2 = 10, where 
the Ay's are i.i.d. for j = {1,5,6,7,8,9,10}, but Xj 2 = pXn + te* 2 , 
X i3 = pXn + T£* 10 , and X i4 = pX i2 + re* n , for r = 0.01 and p = 0.95. 

We used a maximum model size of k n = n 1//2 which technically goes be- 
yond the theory but works well in practice. Prior to analysis, the covariates 
are scaled so that each has mean and variance 1. The tests were initially 
performed using a third of the data for each of the 3 stages of the proce- 
dure (Table 1, top half, 3 splits). For models A, B and C, each approach 
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Table 1 

Size and power of screen and clean procedures using lasso, stepwise and marginal 
regression for the screening step. For all procedures a — 0.05. For p = 100, 8 = 0.5 and 

for p= 1000, S = 1.5. Reported power is 7r a v. The top 8 rows of simulations were 
conducted using three stages as described in Section 4, with a third of the data used for 
each stage. The bottom 8 rows of simulations were conducted splitting the data in half, 
using the first portion with leave-one-out cross validation for stages 1 and 2 and the 

second portion for cleaning 



Splits 


n 


V 


Model 




Size 






Power 




Lasso 


Step 


Marg 


Lasso 


Step 


Marg 


2 


100 


100 


A 


0.005 


0.001 


0.004 











2 


100 


100 


B 


0.01 


0.02 


0.03 


0.62 


0.62 


0.31 


2 


100 


100 


C 


0.001 


0.01 


0.01 


0.77 


0.57 


0.21 


2 


100 


10 


D 


0.291 


0.283 


0.143 


0.08 


0.08 


0.04 


2 


100 


1000 


A 


0.001 


0.002 


0.010 











2 


100 


1000 


B 


0.002 


0.020 


0.010 


0.17 


0.09 


0.11 


2 


100 


1000 


C 


0.02 


0.14 


0.01 


0.27 


0.15 


0.11 


2 


1000 


10 


D 


0.291 


0.283 


0.143 


0.08 


0.08 


0.04 


3 


100 


100 


A 


0.040 


0.050 


0.030 











3 


100 


100 


B 


0.02 


0.01 


0.02 


0.91 


0.90 


0.56 


:{ 


100 


100 


C 


0.03 


0.04 


0.03 


0.91 


0.88 


0.41 


3 


100 


10 


D 


0.382 


0.343 


0.183 


0.16 


0.18 


0.09 


3 


100 


1000 


A 


0.035 


0.045 


0.040 











3 


100 


1000 


B 


0.045 


0.020 


0.035 


0.57 


0.66 


0.29 


3 


100 


1000 


C 


0.06 


0.070 


0.020 


0.74 


0.65 


0.19 


3 


1000 


10 


D 


0.481 


0.486 


0.187 


0.17 


0.17 


0.13 



has type I error less than a, except the stepwise procedure which has trou- 
ble with model C when n = p = 100. We also calculated the false positive 
rate and found it to be very low (about 10 -4 when p = 100 and 10 -5 when 
p = 1000) indicating that even when a type I error occurs, only a very small 
number of terms are included erroneously. The lasso screening procedure 
exhibited a slight power advantage over the stepwise procedure. Both meth- 
ods dominated the marginal approach. The Markov dependence structure 
in model C clearly challenged the marginal approach. For model D, none of 
the approaches controlled the type I error. 

To determine the sensitivity of the approach to using distinct data for each 
stage of the analysis, simulations were conducted screening on the first half 
of the data and cleaning on the second half (2 splits). The tuning parameter 
was selected using leave-one-out cross validation (Table 1, bottom half). As 
expected, this approach lead to a dramatic increase in the power of all the 
procedures. More surprising is the fact that the type I error was near a or 
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below for models A, B and C. Clearly this approach has advantages over 
data splitting and merits further investigation. 

A natural competitor to screen and clean procedure is a two-stage adap- 
tive lasso [Zou (2006)]. In our implementation, we split the data and used 
one half for each stage of the analysis. At stage one, leave-one-out cross val- 
idation lasso screens the data. In stage two, the adaptive lasso, with weights 
Wj = \ f3j\~ 1 , cleans the data. The tuning parameter for the lasso was again 
chosen using leave-one-out cross validation. Table 2 provides the size, power 
and false positive rate (FPR) for this procedure. Naturally, the adaptive 
lasso does not control the size of the test, but the FPR is small. The power 
of the test is greater than we found for our lasso screen and clean procedure, 
but this extra power comes at the cost of a much higher type I error rate. 

7. Proofs. Recall that if A is a square matrix, then <p(A) and ${A) 
denote the smallest and largest eigenvalues of A. Throughout the proofs we 
make use of the following fact. If v is a vector and A is a square matrix, 
then 

(14) (j){A)\\v\\ 2 <v T Av<<S>(A)\\v\\ 2 . 

We use the following standard tail bound: if Z ~ N(0, 1), then P(\Z\ >t)< 

use the following results about the lasso from Mein- 
shausen and Yu (2008). Their results are stated and proved for fixed X but, 
under the conditions (Al)-(A5), it is easy to see that their conditions hold 
with probability tending to one and so their results hold for random X as 
well. 

Theorem 7.1 [Meinshausen and Yu (2008)]. Let /3(A) be the lasso esti- 
mator. 



Table 2 

Size, power and false positive rate (FPR) of two-stage adaptive lasso procedure 



n 


p 


Model 


Size 


Power 


FPR 


100 


100 


A 


0.93 





0.032 


100 


100 


B 


0.84 


0.97 


0.034 


100 


100 


C 


0.81 


0.96 


0.031 


100 


10 


D 


0.67 


0.21 


0.114 


100 


1000 


A 


0.96 





0.004 


100 


1000 


B 


0.89 


0.65 


0.004 


100 


1000 


C 


0.76 


0.77 


0.002 


1000 


10 


D 


0.73 


0.24 


0.013 
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1. The squared error satisfies 

where m = \S n (X)\ and c> is a constant. 

2. The size of S n (X) satisfies 

r 2 Cn 2 



(16) 



SnW\ < 



X 2 



where r 2 =E(Y 2 ). 

Proof of Lemma 3.1. Let D c M and . 

L0m) = —^XMiXlfXM^Xjje < — 



n~ x X T M X M ). Then, 



1 



n 



72 



where Zj = n l / 2 X^-e. Conditional on X, Z^ ~ N(0,a 2 ) where a 2 = n 1 x 
Yh=\ Xfj- Let A 2 = maxi<j< Pn a 2 . By HoefTding's inequality, (A2) and (A5), 



F(E n ) -> 1 where = {A n < ^2}. So 
max > ^/41ogp n ) 



< 



< 



< 



E 



<o 



max \ZA > y/4\ogp n ,E n +P max |Zd > v / 41ogp n ,£'^ 

l<j<Pn / \l<7<Pn 



max > ^logpn.^n +P(££ 

l<j<Pn 



1^ 



A n max > v/41ogp„,£; n + o(l) 

l<j<Pn Oj 



max — — > v / 21ogp n + o(l) 

l<J<Pn Oj 



\Zi 



max > v /21ogp n X + o(l) 



1 



V21ogp n 



+ ofl) = ofl). 



But J2jeM^j — mmaxi< 3 -<p n Z| and (6) follows. 

Now we lower bound L((3m)- Let M be such that L> ^ M. Let A = 
{j'-PmU) / 0} U D. Then, |A| < m + s. Therefore, with probability tend- 
ing to 1, 

L(0 M ) = - P) T X T X((3 M -P) = ~ (J3m - 0) T XlX A M - 0) 
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> re (m + s)0 M - /3|| 2 = Mm + s)J20M(j) - P(j)f 

>(j) n {m + s) Y, (O-/3(i)) 2 >0n(m + s)^ 2 . 

j£DnM c n 

Proof of Theorem 3.2. Let Y denote theresponses, and X the design 
matrix, for the second half of the data. Then, Y = X(3 + e. Now 



and 



L(A) = -(/3(A) - P) T X T X0(\) - /3) = (/3(A) - /3) T £„03(A) - /3) 
n 



L(A) = n~ l \\Y - Xf3{\)\\ 2 = (/3(A) - /3) T S n (^(A) - /3) + <5 n + - (e, X(0(A) - /3)) , 

n 

where 5 n = ||ej| 2 /n, and S n = n^X 1 X and S n = n~ 1 X T X . By Hoeffding's 
inequality 

P(|S n (j,fc)-S n (i,fc)|> e )<e-" C£2 
for some c > 0, and so 

p(max|E„0",fc) - E„0",fc)| >e) <p 2 e" nc£2 . 

Choose e n = 4/(cn 1_C2 ). It follows that 

pfmax|E n (j,fe)-S n (j,A;)| > < e" 2 ^ 2 - 0. 

Note that 

IO":^-(A)^0}U{j: J 9 i ^0}|<fc n + s. 
Hence, with probability tending to 1 

| L (A) - L(A) - <J n | < -4^r 2 \\m ~ &fx + 2^(A) 



for all As A re , where 



£n(A) = ~ y)ei/ii(A) 



and ia(\)=Xj (/3(A) -/3). Now ||/3(A) -/3||f = P ((A; n + s) 2 ) since 
P (k n /(f>(k n )). Thus, ||/3(A)-/3||i <C(/c n + s) with probability tending to 1, 
for someC>0. Also, < £||/3(A) - /3||i <BC(fc n + s) with probability 

tending to 1. Let ~ N(0, 1). Conditional on V 1 , 



n 

|£ n (A)| ^ £^(A)|W| < ^5C(fc n + 
v n \ i=l v n 
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so sup AgAn |£„(A)| =0 P {k n /y/n). □ 



Proof of Theorem 4.1. 1. Let A n = rn^C/kn, M = S n (X n ) and m = 
|M|. Then, P(m < fe„) -> 1 due to (16). Hence, P(A n G A n ) -> 1. From (15), 

P(A.) - < o(i-) + Op(^^) = 0P (1). 

Hence, ||/?(A n ) - = o P (l). So, for each j G D, 

|&(A„)| > |/%| - |^(A n ) ~Pj\>^ + P (1) 

and hence, P(minj e £) |/%(A n )| > 0) — ► 1. Therefore, r n = {A G A n : D C S n (X)} 
is nonempty. By Lemma 3.1, 

(17) L(X n ) < cm log p n /(n(f>(m)) = P {k n \ogp n /n). 
On the other hand, from Lemma 3.1, 

(18) F( inf L0 x )>^4>(k n )\ -1. 

Now, n<f> n {k n ) I [k n log p n ) — ► oo, and so (17) and (18) imply that 



Thus, if A* denotes the minimizer of L(X) over A n , we conclude that P(A* G 
r„) -► 1, and hence, P(£> C 5„(A»)) -> 1. 

2. This follows from part 1 and Theorem 3.2. 

3. Let A = S n Pi -D c . We want to show that 

pfmaxll)! >c n ^j <a + o{l). 

Now, 

P ^max \Tj\ > cj = P (max \Tj \ > c n ,D C S^j + P ^max \Tj \ > c n , D <f_ 

< P( max \TA > c n ,D C S n ) + F(D <£ S n ) 

< pfmax |Tj-| > Cn, D C S„J + o(l). 

Conditional on (T>i,T>2), Pa is normally distributed with mean and vari- 
ance matrix ct 2 (A"Ja^4) -1 when .D C Recall that 

r .( M ) _ e J( X M X M) -1 ^M^ _ ^Mj 



j 
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where M = S n , sj = a 2 e T j{X\ l Xu)~ x ^i and ej = (0, ... ,0,1,0, ... , 0) T , where 
the 1 is in the jih coordinate. When D C S n , each Tj, for j £ A, has a t- 
distribution with n — m degrees of freedom where m = \S n \. Also, c n / t a / 2m ~^ 
1 where t u denotes the upper tail critical value for the t-distribution. Hence, 

Pf max IT,- 1 > Cn,D C S n \Vi,V 2 



'^max|Tj| > t a/2m ,Dc S n \V 1 ,V 2 ) + a n 



<a + a n , 

where a n = o(l), since \A\ < m. It follows that 



M max \Tj\ > c n , D C S n I < a + o(l). 



PROOF of Theorem 4.2. From Theorem 4.1, P(-D n nD c /0)<a„ 
and so F(D n n D c / 0) — >■ 0. Hence, P(5 n Cf)-»1. It remains to be shown 
that 

(19) F(D C 5 n ) -> 1. 

The test statistic for testing /3j = when 5 n = M is 



e]{X T M X M )- l X T M Y 
aJef(XTX M )-i ej ' 



Tj(M) = 

For simplicity in the proof, let us take a = a, the extension to unknown a 
being straightforward. Let j <E D, M = {M : \M\ < k n ,D C M}. Then, 

F(j i D n ) = F(j iD n ,Dc S n ) + F(j iD n ,D£ S n ) 

<F(j<£D n ,DcS n )+F(D<£S n ) 

= F(j?D n ,DcS n ) + o(l) 

= nHDnA = M)+o(l) 

MeM 

< n\T j (M)\<c n ,S n = M) + o(l) 
MeM 

< J2 P(|T J (M)|< Cn ) + o(l). 
MeM 

Conditional on T> 1 UT> 2 , for each MeM, Tj(M) = (fa/s^ + Z, where Z~ 
N{0, 1). Without loss of generality, assume that (3j > 0. Hence, 

P(|Tj(M)| < c n \V 1 UV 2 )=F(-c n - ^ < Z < c n - ^-). 

V Sj SjJ 



VARIABLE SELECTION 



17 



Fix a small e > 0. Note that s 2 < a 2 /(ntt). It follows that, for all large n, 

C n - Pj/Sj < -Eyjn. So, 

¥(\Tj(M)\ < c n \V 1 UV 2 ) <P(Z < -e^) < e~ n£2 / 2 . 
The number of models in Ai is 

where we used the inequality 

k 
So, 

]T n\Tj(M)\ < c n \V 1 UV 2 ) < k nP k n "e- n£2 

by (A2). We have thus shown that F(j ^ D n ) — > 0, for each j G L>. Since |D| 
is finite, it follows that P(j ^ D n for some j e D) -> and hence (19). □ 

PROOF of Theorem 4.5. A simple modification of Theorem 3.1 of 
Barron et al. (2008) shows that 

L(k n ) = -\\Y kn -Xp\\ 2 = o P (l). 
n 

[The modification is needed because Barron et al. (2008) require Y to be 
bounded while we have assumed that Y is normal. By a truncation argument, 
we can still derive the bound on L(k n ).] So 

L(k n ) ^ L( 

4>n(k n +s) ~ K 

Hence, for any e > 0, with probability tending to 1, \\ f3{k n ) — f3\\ 2 < e so that 

\Pj\ > tp/2 > 0, for all j G D. Thus, F(D C S n (k n )) -> 1. The remainder of 
the proof of part 1 is the same as in Theorem 4.1. Part 2 follows from the 
previous result together with Theorem 3.2. The proof of part 3 is the same 
as for Theorem 4.1. □ 

PROOF of Theorem 4.6. Note that fij — fj,j = rT 1 Yh=i Xij £ i- Hence, 
Jlj — fij ~ -/V(0, 1/n). So, for any 5 > 0, 

p(niax|/X,- - Hj\ > d) < ^Pflju,- - > 5) 

< ^2_ e -ni=«/2 < cie nC2 c _ nS 2 /2 ^ Q _ 
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By (12), conclude that D C S n (X) when A = fi>(k n )- The remainder of the 
proof is the same as the proof of Theorem 4.5. □ 

Proof of Theorem 4.7. Let A = S„ n D c . We want to show that 



YmaxlTjl >c n ^j <a + o(l). 



For fixed A, (3a is normal with mean but this is not true for random A. 

\. Recall that 



Instead we need to bound T~. Recall that 



Tj(M) 



aJej(XlX M )-^ 



where M = S n , s] = S ;2 eJ(X^A" A /)~ 1 e i and ej = (0, . . . , 0, 1, 0, . . . , 0) T where 
the 1 is in the jih coordinate. The probabilities that follow are conditional 
on T>\ but this is supressed for notational convenience. First, write 



'( max \Tj \ > c n 

\j£A 



( max \Tj\ > c n , D C S n ) + P( max \Tj\ > c n ,D <£_ S n 



< Pf max \Tj\ > c n , D C S n ) + F(D £ S n 

Vie- 4 



< Pf max \Tj \ > Cn, D C S n ) + o(l). 



When DcS n , 



1 



-T 



-XI e = % + Qc 7c , 



where Q~ = {{l/n)Xl X^)" 1 , j 



-- n~ l XT e, and (j) = 0, for j G A. 

•~>n Oyi 



Now, sj > a 2 /{nC) so that 



\Tj\< 



< 



(7 



for j £ S n . Therefore, 

P( max IK I >c n ,DcSn) <P| max|/3~ .1 > ^flcSJ. 



Let 7 = n~ 1 X T e. Then 



?a|| 2 <7|Q|75 < 

On O n °n 



< 



fc n maxi< i < p „ 7 
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It follows that 



max|/fe | < ~ J - Pn 3 < Vfc n loglogn max 

since k > 0. So, 

p(max|/3 ? | > - ffC " = ,DcS n ) <P( max | 7 -| > ^ __ 
Yj Gj 4 a»j yreloglogn / V 1 ^.?^? 

n log log ^ V 

Note that 7 j ~ N(0,a 2 /n), and hence, 



E(max|7j| J < 



/ 2a 2 log(2p n ) 
\ 3 " J 7 ~ I n 
There exists e n — > such that P(i? n ) — > 1 where i? n = {(1 — e n ) < ct/ct < 
(1+e)}. So, 

max | 7 | > dCn ) < F ( max U.| > <7C " (1 ~!2j_ , J B T 

i<j<Pn log log n v n/c n / \i<j<Pn loglognvn/c n 

— ~ ; % - IN. max|7 ? - 

ct(1 - e n )c n vloglogn \ i 

<a + o(l). □ 



8. Discussion. The multi-stage method presented in this paper success- 
fully controls type I error while giving reasonable power. The lasso and step- 
wise have similar performance. Although theoretical results assume indepen- 
dent data for each of the three stages, simulations suggest that leave-one-out 
cross-validation leads to valid type I error rates and greater power. Screening 
the data in one phase of the experiment and cleaning in a followup phase 
leads to an efficient experimental design. Certainly this approach deserves 
further theoretical investigation. In particular, the question of optimality is 
an open question. 

The literature on high-dimensional variable selection is growing quickly. 
The most important deficiency in much of this work, including this paper, 
is the assumption that the model Y = X T (3 + e is correct. In reality, the 
model is at best an approximation. It is possible to study linear procedures 
when the linear model is not assumed to hold as in Greenshtein and Ritov 
(2004). We discuss this point in the Appendix. Nevertheless, it seems useful 
to study the problem under the assumption of linearity to gain insight into 
these methods. Future work should be directed at exploring the robustness 
of the results when the model is wrong. 

Other possible extensions include: dropping the normality of the errors, 
permitting nonconstant variance, investigating the optimal sample sizes for 
each stage and considering other screening methods besides cross-validation. 
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Finally, let us note that the example involving unfaithfulness, that is, 
cancellations of parameters to make the marginal correlation much different 
than the regression coefficient, pose a challenge for all the methods and 
deserve more attention even in cases of small p. 

APPENDIX: PREDICTION 

Realistically, there is little reason to believe that the linear model is cor- 
rect. Even if we drop the assumption that the linear model is correct, sparse 
methods like the lasso can still have good properties as shown in Greenshtein 
and Ritov (2004). In particular, they showed that the lasso satisfies a risk 
consistency property. In this appendix we show that this property continues 
to hold if A is chosen by cross-validation. 

The lasso estimator is the minimizer of Ya=i0^ ~ Xj f3) 2 + \\\(3\\\. This 
is equivalent to minimizing Ya=i(Xi ~ X-Jfl) 2 subject to \\(3\\i < fi, for some 
f2. (More precisely, the set of estimators as A varies is the same as the set of 
estimators as f2 varies.) We use this second version throughout this section. 

The predictive risk of a linear predictor £(x) = x T j3 is R(P) = K(Y — £{x)) 2 
where (X, Y) denotes a new observation. Let 7 = 7(/3) = (— 1, /3\, . . . , (5 P ) T 
and let T = E(ZZ T ) where Z = (Y,X 1 , . . . ,X p ). Then, we can write R{(3) = 
7 T r7. The lasso estimator can now be written as (3(Q n ) = ^rS m ^ n /3eB(n n ) R(P) 
where R(/3) = 7 T f 7 and f = n~ l £? =1 Ztf? . 

Define 

/3* = argmini?(/?), 

where 

B(n n ) = {(3:\\(3\\ 1 <n n }. 

Thus, £*(x) = x T (3* is the best linear predictor in the set B{Vt n ). The best 
linear predictor is well defined even though E(y|X) is no longer assumed to 
be linear. Greenshtein and Ritov (2004) call an estimator n persistent, or 
predictive risk consistent, if 

R0 n )-R((3*)^O 

as n — > 00. 

The assumptions we make in this section are: 
(Bl) p n < e n( , for some < £ < 1; 

(B2) the elements of T satisfy an exponential inequality 

n|f jfc -r jfc |> e )< C3 e-™ C4e2 

for some C3, C4 > 0; 
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(B3) there exists Bq <oo such that, for all n, maXjkM(\ZjZk\) < .Bo- 
Condition (A2) can easily be deduced from more primitive assumptions 

as in Greenshtein and Ritov (2004), but for simplicity we take (A2) as an 

assumption. Let us review one of the results in Greenshtein and Ritov (2004). 

For the moment, replace (Al) with the assumption that p n < n b , for some 

b. Under these conditions, it follows that 



logn 



A n = max Tj fe - T jk \ = O p 

3,k VV n 



Hence, 



sup \R(J3) -R(P)\ = sup |7 T (r-r) 7 | 

<A n sup || 7 ||2 = ^O p | 



'logn 



n 



The latter term is op(l) as long as Q n = o((n/ logn) 1 / 4 ). Thus, we have the 
following. 

Theorem A.l [Greenshtein and Ritov (2004)]. IfQ n = o((n/ logn) 1 / 4 ), 
then the lasso estimator is persistent. 

For future reference, let us state a slightly different version of their result 
that we will need. We omit the proof. 

Theorem A. 2. Let-f>0 be such that £+7 < 1- Let Q n = O^ 1 -^)/^ 
Then, under (Bl) and (B2), 



(20) P( sup \R(p)- Rm> A-\= (e-^ /2 ) 

for some c> 0. 



The estimator /3(£l n ) lies on the boundary of the ball B(Q n ) and is very 
sensitive to the exact choice of f2 n . A potential improvement — and something 
that reflects actual practice — is to compute the set of lasso estimators /3(£), 
for < £ < £l n and then select from that set based on cross validation. We 
now confirm that the resulting estimator preserves persistence. As before, 
we split the data into T>\ and T>2- Construct the lasso estimators {/?(£): < 
£ < f^n}- Choose £ by cross validation using T>2- Let (5 = (3(£). 
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Theorem A. 3. Let 7 > be such that £ + 7 < 1. Under (Al), (A2) 

and (A3), if n n = O^ 1 "^/ 4 ), i/ien tfie cross-validated lasso estimator (5 
is persistent. Moreover, 

(21) #(£)- inf R0(£))Zo. 

Proof. Let = argming eBW i?(/3). Define h(£) = R(J3*(£)), g(£) = 

R(/3(£)) and c(£) = L(f3{£)). Note that, for any vector 6, we can write Rib) = 
t 2 + 6 T S6 - 2b T p where p = (E(FXi), . . .,E(FX p )) T . 

Clearly, /i is monotone nonincreasing on [0,f2 n ]. We claim that \h{£ + 
5) — /i(^)| < cVL n 5 where c depends only on T. To see this, let u = (3*(£), 
v = p^(£ + S) and a = £$*{£ + 5)/(£ + 5) so that a G £(£). Then, 

/i(^ + <5) </i(i) 

= J2(u) < J?(a) 

= R(v) + R(a) - R(y) 

L /, « 2 ^ r 5(2^ + 5) ji 

< h(i + <5) + 25C + <5(20 n , + £)C, 

where C = max^fc \^j,k\ = 0(1). 

Next, we claim that g(£) is Lipschitz on [0, £l n ] with probability tending to 
1. Let (3{£) = argming GB (£) R(f5) denote the lasso estimator and set u = f3(£) 

and v = P(£ + 5). Let e n = n _7//4 . From (20), the following chain of equations 
hold except on a set of exponentially small probability: 

g (£ + S) = R(v) < R(v) + e n < R(v) + e n 

< R{v) + 2e n = h{£ + 5) + 2e n 

< h{£) + cQ n 5 + 2e n = R(u) + cQ n 5 + 2e n 

< R(u) + cn n 6 + 2e n = g{£) + cQ, n 5 + 2e n . 

A similar argument can be applied in the other direction. Conclude that 

\g(£ + S)-g(£)\<cfl n 6 + 2e n 

except on a set of small probability. 

Now let A = {0,5,25, . . . ,m5} where m is the smallest integer such that 
mS > U n . Thus, m ~ n n /S n . Choose 5 = 5 n = n - ^ 1 -*-?)/ 8 . Then, n n 5 n 
and £l n /5 n < n 3 ^ 1 "^" 7 ^ 4 . Using the same argument as in the proof of 
Theorem 3.2, 

m^x\L0{l))-R0{£))\=a n , 
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where a n = op(l). Then, 

R(P*(n n ))<R0)<L(P(£)) + a n 

< L(mS n ) + a n < g(mS n ) + 2a n < g(Q n ) + 2a n + c£l n 5 n 

< h(Q n ) + 2a n + e„ + cQ n 5 n 

= R(/3*(tl n )) + 2a - n + e n + cQ n 5 n 

and persistence follows. To show the second result, let (3 = argmino<^<o n g{i) 
and /? = argmin^gA^^)- Then, 

R0)<L0) + a n <L(p)+a n 

< R(J3) + 2a n < R0) + 2a n + c5 n n n 

and the claim follows. □ 
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